Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean) by Gusanidas · Pull Request #1212 · openai/parameter-golf

Gusanidas · 2026-04-01T05:42:13Z

Record: Window Attention + Mixed Seq_Len Training

val_bpb: 1.1108 (5-seed mean, std 0.0013) | 1.8755 nats | ~15.73 MB | 8xH100 SXM, 600s | No TTT

I started from PR #1130 (KitchenSinkV2 Improved), which added split early/late LR banks, MiLe margin loss, cache+backout residual, residual lambdas, bigger bigram/VE, and FA3 on top of the PR #549 stack. On top of that, I ported the fused Triton MLP from PR #1072 and the sigmoid-gated skips + brotli+byte-shuffle compression from PR #1089. I also increased to 12 layers and tuned qk_gain to 2.5.

The two main contributions of this submission are window attention and mixed seq_len training, described below.

Results (8xH100 80GB SXM, 600s, no TTT)

Seed	Steps	ms/step	Post-EMA BPB	Sliding BPB	val_loss (nats)	Artifact
2	8,428	69.6	1.1250	1.1094	1.8731	15,726,762
1337	8,428	69.6	1.1250	1.1101	1.8742	15,721,698
42	8,428	69.6	1.1250	1.1103	1.8746	15,725,995
7	8,428	69.6	1.1250	1.1119	1.8773	15,723,346
22	8,428	69.6	1.1250	1.1126	1.8785	15,720,902
Mean				1.1108	1.8755	15,723,741

Current merged SOTA (2026-03-25 AR Self-Gen GPTQ + XSA-all + BigramHash 3072x112): 1.11473 BPB.
Delta vs current merged SOTA: -0.0039 BPB (-0.0066 nats).

Window attention

Instead of full causal attention on every layer, layers 2, 4, 6, 8, and 10 use a sliding window of 512 tokens via Flash Attention 3's window_size parameter. The remaining layers (0, 1, 3, 5, 7, 9, 11) keep full attention.

The motivation was to enable training at longer sequence lengths without proportionally increasing compute. Full quadratic attention at seq_len=6144 is expensive, but with window attention on 5 of 12 layers, those layers run in O(n * w) instead of O(n^2), cutting the per-step cost significantly. The layers with full attention still give the model access to the full context.

I swept several configurations: window sizes (256, 512, 1024), which layers to window (sparse, dense, even), and how many layers. Window 512 on even-indexed layers was the sweet spot — enough layers windowed to get the speedup, enough full-attention layers to preserve long-range modeling.

At seq_len=2048 (where all tokens fit in a 512-wide window anyway for most positions), windowed attention adds a small overhead (~2-3%). The benefit kicks in at longer sequences: 15% faster at 4096, 21% at 6144, 25% at 8192.

Mixed seq_len training

Different GPUs train with different sequence lengths within the same step. In the final configuration, 5 GPUs train at seq_len=2048 and 3 GPUs train at seq_len=6144. The number of sequences per GPU is set so that the total ms per step stays roughly constant.

The idea came from noticing that the sliding-window eval (which uses long sequences) gave substantially better scores than the standard 2048-token eval, but training at long sequence lengths was slow. By having most GPUs train cheaply at 2048 and a few GPUs see long context at 6144, the model gets the best of both: high step throughput from the short-sequence GPUs and long-range learning from the long-sequence ones.

I ran an extensive sweep of seq_len combinations. Some findings:

3x2048 + 1x6144 (eval at 6144) gave the best int6 roundtrip BPB (1.1292) in 4-GPU experiments, beating both pure 4x2048 (1.1417) and pure 4x6144 (1.1360)
Having at least one GPU on a long sequence (4096+) was critical for good quantized performance
More short-sequence GPUs = more steps in the same wallclock, which helps training loss
More long-sequence GPUs = better post-EMA loss, but fewer steps and worse quantization
8192 was too slow to be worthwhile — the step-time penalty outweighed the context benefit

For the final 8-GPU submission, I used 5x2048 + 3x6144, which balances throughput and long-context exposure.

Other changes

12 layers (up from 11) with split early/late LR banks
Sigmoid-gated skip connections — x += sigmoid(gate) * skip replaces learned scalar skip weights
Fused Triton MLP (PR Record: Fused MLP (Triton+CUTLASS EVT) + Fast Causal N-Gram Tilt & Subword Certainty (3-seed mean) #1105) — LeakyReLU(0.5)-squared fused with matmuls
Brotli + byte-shuffle compression (PR Record Submission: 1.1091 BPB - Turbo-Muon + EngramLite + ParamBanking + XSA (11L 512d) #1089) — better compression of quantized weights
Bigram hash 5120, VE dim 128, qk_gain 2.5
Eval: sliding window, seq_len=6144, stride=128

Artifact size (worst-case, seed 2)

Component	Bytes
Model (int6+brotli)	15,692,661
Code	34,101
Total	15,726,762

Under the 16,000,000 byte limit.

Acknowledgments

This submission builds on many contributions from the parameter-golf community:

Baseline (modded-nanogpt, @KellerJordan et al.) — Muon optimizer, relu², U-Net skips, softcap, RoPE, Q-gain, ResidMix
modded-nanogpt PR #140 (@ClassicLarry / @snimu) — backout residual
PR Record: Sliding Window Eval (stride=64), val_bpb=1.1925 #50 (@mattqlf) — sliding window eval
PR Record: DominationV3 + GPTQ-lite + TTT25 (mean val_bpb=1.1250, 3 seeds) #64 (@yesbhautik) — GPTQ-lite (clip percentile search)
PR record: val_bpb=1.1622, NorMuon + int6 STE + SWA + sliding window #89 / PROTEUS EMA — val_bpb: 1.1836 (3-seed mean, Notable Non-Record) #95 (@vmfunc / @MatoTeziTanka) — EMA / SWA
PR Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483) #162 (@raahilshah) — BigramHash
PR Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271) #287 (@jfprincz) — XSA (cross-head subtracted attention; arXiv:2603.09078)
PR Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) #315 (@jfprincz) — partial RoPE, layerwise LN scale
PR Record: 11L + Tight SWA + Shared VE128 + Partial RoPE + LN Scale + XSA4 (val_bpb: 1.1246) #374 (@unnir) — value embeddings
PR Record: Parallel Muon + Parameter Banking — 81.87ms/step, val_bpb 1.1247 (3-seed mean) #399 (@abaybektursun) — parallel Muon with parameter banking
PR Record: 11L EMA + Int6 + XSA + LeakyReLU² + Partial RoPE (val_bpb: 1.1309) #493 (@parinzee) — LeakyReLU(0.5)²
PR Record: 11L LeakyReLU² + Full GPTQ + QAT Alignment (val_bpb: 1.1204) #535 (@raahilshah) — full Hessian GPTQ
PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 (@abaybektursun) — banked weight matrices, SmearGate
PR Memmap multi-shard data pipeline + GPU prefetch + LeakyReLU² + Legal TTT + Parallel Muon #726 (@DeepReinforce) — coprime-stride multi-shard data loader
PR Record: Fused LeakyReLU² + Online GPTQ + Parallel Muon — val_bpb 1.117 (1-seed) #1072 (@vimeto) — fused Triton LeakyReLU-squared MLP kernel
PR Record Submission: 1.1091 BPB - Turbo-Muon + EngramLite + ParamBanking + XSA (11L 512d) #1089 (@mikeapedia) — sigmoid-gated skip connections, byte-shuffle + brotli compression
Flash Attention 3 with window_size for efficient window attention

Reproducibility

The main training runs used the following command:

SEED=$SEED \
MATRIX_LR=0.024 MATRIX_LR_LATE=0.019 \
SCALAR_LR=0.020 SCALAR_LR_LATE=0.038 \
TIED_EMBED_LR=0.022 \
MUON_MOMENTUM=0.985 WARMDOWN_ITERS=4000 \
TRAIN_BATCH_TOKENS=589824 \
NUM_LAYERS=12 BIGRAM_VOCAB_SIZE=5120 VE_DIM=128 \
WINDOW_SIZE=512 WINDOW_ATTN_LAYERS=2,4,6,8,10 \
LOCAL_SEQS_PER_GPU=36,36,36,36,36,10,10,10 \
SEQS_PER_GPU=2048,2048,2048,2048,2048,6144,6144,6144 \
MAX_WALLCLOCK_SECONDS=600 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

brotli needs to be installed for the final artifact compression path. Flash Attention 3 (flash_attn_interface) is required.

12-layer split-bank U-Net with window attention (size=512 on layers 2,4,6,8,10), mixed seq_len training (5 GPUs at 2048 + 3 GPUs at 6144), fused Triton LeakyReLU-squared MLP, sigmoid-gated skip connections, brotli+byte-shuffle compression, GPTQ int6, sliding window eval (stride=128, seq_len=6144). 5-seed results: 1.1094, 1.1101, 1.1103, 1.1119, 1.1126 (mean 1.1083) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

records/track_10min_16mb/2026-03-31_KitchenSinkV3/README.md

Window attention (PR openai#1212 approach): - Layers 2,4,6,8,10: sliding window size=512 - Remaining layers: full causal attention - Enables training at seq_len=6144 without proportional cost increase Mixed seq_len training (PR openai#1212 approach): - 5 GPUs at seq_len=2048 (high throughput) - 3 GPUs at seq_len=6144 (long context learning) - Best of both: many steps + long-range modeling Eval at seq_len=6144, stride=128 (vs 2048/64 before) - Longer eval context = better compression = lower BPP All controlled by env vars, defaults preserve original behavior. PR openai#1212 showed -0.031 BPP from this approach on sp1024.

Thesis: the speed path is the most underutilized section of openai/parameter-golf. The quality path has 170+ PRs; the speed path has maybe 30 and 2-3 genuine novelties. Our 13x per-GPU gap vs comp records is almost entirely soft — most of it collapses under free wins + comp ports. Findings: TIER 0 FREE WINS (before any kernel work) — ~3x speedup, 2-3 days total: - Shot 0a: drop grad_accum_steps 8→1. The single biggest easy win hiding in plain sight. We're paying 8x kernel-launch overhead because grad_accum was inherited from an 8xGPU distributed config. 5 LOC, 30-50% speedup. - Shot 0b: eval batched + streaming KV cache. Current sliding-window eval is 625K sequential forwards at B=1 stride=64. 97% of each window's context is shared with the previous. Streaming KV (StreamingLLM arXiv 2309.17453) gives 5-15x eval speedup, saves 3-5 min of the 600s budget. - Shot 0c: SkyLadder progressive seq_len 256→2048 (NeurIPS 2025 arXiv 2503.15450). 22% throughput + 1-3.7% quality. Already in Mac SETUP §35 backlog, never shipped. - Shot 0d: train-data GPTQ calibration (PR openai#1219, comp-organizer-approved). Replaces 220s AR self-gen with 14s. +2000 extra training steps. - Free: TORCHINDUCTOR_MIX_ORDER_REDUCTION=0 + torch 2.9.1 pin. +8.8% step time. TIER 2 COMP-PORT WINS we missed in the original Phase 2 plan: - Shot 9: FA3 varlen + window + mixed seq_len across GPUs (PR openai#1212 holds the fastest step in the leaderboard at 69.6 ms/step) - Shot 10: Parameter Banking + Parallel Muon (PR openai#399): 66 nn.Linear → 4 contiguous 3D banks → Newton-Schulz becomes one bmm → optimizer time 19.7 ms → 1.3 ms (15x). World-novel, NOT in modded-nanogpt. - Shot 11: CUTLASS EVT backward with the novel `post=0.5·act_grad·pre` identity (PRs openai#1105, openai#1420). Identity itself looks world-novel. - Shots 13-14: eval path wins (Triton KV-cache backend, fused softcap+CE megakernel). Combined eval speedup ~5x on top of Shot 0b. TIER 3 BIG DREAMS (world-first opportunities): - Megadream 1: **Training megakernel** (fwd+bwd+optim in a single persistent SM kernel). HazyResearch / Mirage / MegaQwen have inference megakernels; nobody has built one for TRAINING. 1.3us × ~600 launches per step = 16% of our step budget is pure launch overhead. 5-7 days, 500-1500 LOC, ThunderKittens templates. Potential PhD-defensible mini-paper. - Megadream 2: **Streaming KV sliding-window eval** (our Shot 0b, also novel) - Megadream 3: **Fuzzy LR bandit per microbatch** — user's "dial-in" hint operationalized. Thompson sampling from {0.5x, 1x, 2x} * base_lr. 80 LOC. - Megadream 4: **CPU n-gram precompute thread** — user's "CPU while GPU" hint operationalized. BG thread pre-computes n-gram hash tensors, 50 LOC. - Megadream 5: **GPU-resident successive halving** — user's "GPU tests" hint operationalized. Run 4 replicas × 100 steps inside the 600s budget, pick winner, continue. Online hyperband. 200 LOC. - Megadream 6: **AOTInductor precompile + binary ship** — kill the 5+ min compile cold-start permanently. Stacked expected impact: - Phase 1 (now): 180 steps / 600s, val_bpb ~1.4-1.6 - +Tier 0 free wins: ~540 steps, val_bpb ~1.25-1.35 - +Tier 1 kernel work: ~2000 steps, val_bpb ~1.15-1.22 - +Tier 2 comp ports: ~4000 steps, val_bpb ~1.10-1.15 - +Tier 3 Megadream 1 (training megakernel): ~8000 steps, val_bpb ~1.08-1.12 - +Tier 3 all: ~10000 steps, val_bpb ~1.06-1.10 (**ahead of comp on 1xH100**) 10000 steps on 1xH100 = 4x more per-GPU training than the comp's 20000 on 8xH100. That's where val_bpb drops BELOW comp records. Key finding: eval path holds the biggest speed wins currently, not training. Our sliding-window eval eats 10-15 min of the 600s budget. Tier 0b + Tier 2 Shots 13-14 save 5-8 min per eval pass. More than any training-side single patch would buy at our current rate. Source reports: /tmp/phase2_comp_speed_audit.md (22 PRs surveyed), /tmp/phase2_world_speed_research.md (12 research areas surveyed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Documents what's actually in the repo now: SHIPPED: - phase2/README.md, bootstrap.sh, metrics.py, warm_compile_cache.py, run.sh - submission/run.sh: Inductor patch + CUDA allocator expandable segments - submission/train.py ShuffledSequenceLoader: prefetch thread + pinned RAM + prefill during pretime - All gated by env vars with sensible defaults on NOT SHIPPED (future work): - Shot 2 FA3 sourcing (not on PyPI) - Shot 9 FA3 varlen + window attention (PR openai#1212) - Shot 10 Parameter Banking + Parallel Muon (PR openai#399) - Shot 14 Training megakernel (world-first) - Shot 0b batched + streaming KV sliding eval - Shot 17 fuzzy LR bandit - Shot 19 GPU-resident successive halving HONEST SKIPS: - grad_accum 8→1: research agent missed memory math, would OOM - CPU n-gram precompute: research agent missed GPU HBM is 60× faster than CPU→GPU PCIe path for gather ops. Pivoted to prefetch prefill instead. Tasks 7-12 complete (metrics, free env wins, prefetch loader, compile cache warmup, prefill during pretime, bootstrap wiring). Phase 2 Tier 0 is mechanically shipped. Still a plan for the bigger shots. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Gusanidas and others added 7 commits April 1, 2026 06:42

Revert README leaderboard edit — let maintainers update

317588c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix mean bpb: 1.1108 (was incorrectly 1.1083)

634bae7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add acknowledgments for upstream PRs and contributors

7206787

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Rewrite README: focus on window attention + mixed seq_len contributions

8505deb

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix acknowledgments with corrected PR numbers and authors

8388021

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix base PR reference: openai#1130 not openai#1179

2fc09fc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Gusanidas changed the title ~~Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144~~ Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean) Apr 1, 2026

Gusanidas commented Apr 1, 2026

View reviewed changes

records/track_10min_16mb/2026-03-31_KitchenSinkV3/README.md Outdated Show resolved Hide resolved

Update records/track_10min_16mb/2026-03-31_KitchenSinkV3/README.md

482b3b3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean)#1212

Record: Window Attention + Mixed Seq_Len Training, bpb 1.1108, eval at 6144 (5-seed mean)#1212
Gusanidas wants to merge 8 commits intoopenai:mainfrom
Gusanidas:alejandro/ksv2-v3-submission

Gusanidas commented Apr 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Gusanidas commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: Window Attention + Mixed Seq_Len Training

Results (8xH100 80GB SXM, 600s, no TTT)

Window attention

Mixed seq_len training

Other changes

Artifact size (worst-case, seed 2)

Acknowledgments

Reproducibility

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Gusanidas commented Apr 1, 2026 •

edited

Loading